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Abstract 

Given a matrix A/ of low-rank, we consider the problem of reconstructing it from noisy obser- 
vations of a small, random subset of its entries. The problem arises in a variety of applications, 
from collaborative filtering (the 'Netflix problem') to structure-from-motion and positioning. We 
study a low complexity algorithm introduced in [KMOOQj. based on a combination of spectral 
techniques and manifold optimization, that we call here OptSpace. We prove performance guar- 
antees that are order-optimal in a number of circumstances. 

1 Introduction 

Spectral techniques are an authentic workhorse in machine learning, statistics, numerical analysis, 
and signal processing. Given a matrix M, its largest singular values -and the associated singular 
vectors- 'explain' the most significant correlations in the underlying data source. A low-rank approx- 
imation of M can further be used for low-complexity implementations of a number of linear algebra 
algorithms |FKV04] . 

In many practical circumstances we have access only to a sparse subset of the entries of an m x n 
matrix M. It has recently been discovered that, if the matrix M has rank r, and unless it is too 
'structured', a small random subset of its entries allow to reconstruct it exactly. This result was first 
proved by Candes and Recht |CR08| by analyzing a convex relaxation indroduced by Fazel |Faz02j . 
A tighter analysis of the same convex relaxation was carried out in [CT09j . A number of iterative 
schemes to solve the convex optimization problem appeared soon thereafter [CCS081 IMGC09( ITY09] 
(see also [WORM 09] for a generalization). 

In an alternative line of work, the authors of |KMO09] attacked the same problem using a 
combination of spectral techniques and manifold optimization: We will refer to their algorithm 
as OptSpace. OptSpace is intrinsically of low complexity, the most complex operation being 
computing r singular values of a sparse mxn matrix. The performance guarantees proved in [KMQ09] 
are comparable with the information theoretic lower bound: roughly nr max{r, log n} random entries 
are needed to reconstruct M exactly (here we assume m of order n). A related approach was also 
developed in [LB09J, although without performance guarantees for matrix completion. 

The above results crucially rely on the assumption that M is exactly a rank r matrix. For many 
applications of interest, this assumption is unrealistic and it is therefore important to investigate their 
robustness. Can the above approaches be generalized when the underlying data is 'well approximated' 
by a rank r matrix? This question was addressed in [CP09j within the convex relaxation approach 
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of |CR08| ■ The present paper proves a similar robustness result for OptSpace. Remarkably the 
guarantees we obtain are order-optimal in a variety of circumstances, and improve over the analogus 
results of jCPOOj . 



1.1 Model Definition 

Let M be an m X n matrix of rank r, that is 

M = UT.V'^ . (1) 

where U has dimensions m x r, y has dimensions nxr, and S is a diagonal r x r matrix. We assume 
that each entry of M is perturbed, thus producing an 'approximately' low-rank matrix N, with 

Nij = Mij + Zij , 

where the matrix Z will be assumed to be 'small' in an appropriate sense. 

Out of the mxn entries of N, a subset E C [m] x [n] is revealed. We let be the mxn matrix 
that contains the revealed entries of A^, and is filled with O's in the other positions 

j^E^i ^ij if(^,j)G^, (2) 
1^ otherwise. 

The set E will be uniformly random given its size 



1.2 Algorithm 

For the reader's convenience, we recall the algorithm introduced in |KMO09] . which we will analyze 
here. The basic idea is to minimize the cost function F(X,Y), defined by 

F(X,Y) = min T(X,Y,S), (3) 
T{X,Y,S) ^ \ ^ (iV,,-(X5y%)2. (4) 

Here X G R"^'^, Y G K,™^'' are orthogonal matrices, normalized by = ml, Y^Y = nl. 

Minimizing F[X, Y) is an a priori difficult task, since is a non-convex function. The key 
insight is that the singular value decomposition (SVD) of A^^ provides an excellent initial guess, 
and that the minimum can be found with high probability by standard gradient descent after this 
initialization. Two caveats must be added to this decription: (1) In general the matrix must be 
'trimmed' to eliminate over-represented rows and columns; (2) For technical reasons, we consider a 
slightly modified cost function to be denoted by F{X,Y). 

OptSpace( matrix ) 

1: Trim , and let be the output; 

2: Compute the rank-r projection of , Tr{N^) = XoSoYff; 

3: Minimize F{X,Y) through gradient descent, with initial condition {Xq,Yq). 

We may note here that the rank of the matrix M, if not known, can be reliably estimated from 
The various steps of the above algorithm are defined as follows. 
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Trimming. We say that a row is 'over-represented' if it contains more than 2\E\/m revealed 
entries (i.e. more than twice the average number of revealed entries). Analogously, a column is over- 
represented if it contains more than 2\E\/n revealed entries. The trimmed matrix is obtained 
from by setting to over-represented rows and columns. 

Rank-r projection. Let 

min(m,n) 
i=l 

be the singular value decomposition of , with singular vectors o"i > ct2 > . . . . We then define 

Apart from an overall normalization, '[^.{N^) is the best rank-r approximation to in Frobenius 
norm. 

Minimization. The modified cost function F is defined as 

F{X,Y) = F{X,Y) + pG{X,Y) (6) 

where X^*) denotes the i-th row of X, and the j-th row of Y. The function Gi : ^ R is 
such that Gi(2;) = if z < 1 and Gi{z) = e^^~^^ — 1 otherwise. Further, we can choose p = Q{ne). 

Let us stress that the regularization term is mainly introduced for our proof technique to work 
(and a broad family of functions Gi would work as well). In numerical experiments we did not find 
any performance loss in setting p = 0. 

One important feature of OptSpace is that F{X, Y) and F{X, Y) are regarded as functions 
of the r-dimensional subspaces of R™ and R" generated (respectively) by the columns of X and 
Y. This interpretation is justified by the fact that F(X, Y) = F{XA, YB) for any two orthogonal 
matrices A, B G R'"^'' (the same property holds for F). The set of r dimensional subspaces of R"^ 
is a differentiable Riemannian manifold G(m, r) (the Grassman manifold). The gradient descent 
algorithm is applied to the function F : M(m,n) = G(m, r) x G(m,r) — > R. For further details on 
optimization by gradient descent on matrix manifolds we refer to [EAS991 IAMS08| . 



1.3 Main Results 

Our first result shows that, in great generality, the rank-r projection of provides a reasonable 
approximation of M. Throughout this paper, without loss of generality, we assume a = m/n> 1. 

Theorem 1.1. Let N = M + Z , where M has rank r, and assume that the subset of revealed entries 
E C [m] X [n] is uniformly random with size \E\. Then there exists numerical constants G and G' 
such that 

'\\M - T.(iV-)||, < { ^\ + C'!^ , 

y/mn \ \E\ I \h\ 

with probability larger than 1 — 
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Projection onto rank-r matrices through SVD is pretty standard (although trimming is crucial 
for achieving the above guarantee). The key point here is that a much better approximation is 
obtained by minimizing the cost F{X,Y) (step 3 in the pseudocode above), provided M satisfies 
an appropriate incoherence condition. Let M = [/Sl/-^ be a low rank matrix, and assume, without 
loss of generality, U^U = ml and V^V = nl. We say that M is {fiQ, fii)-incoherent if the following 
conditions hold. 

Al. For all i G [m], j € [n] we have, Ylk=i ^ik ^ l^or, J2k=i ^ik ^ ^^or■ 
A2. There exists such that | YZi=i f^ifc(Sfc/5^i)^fc| < /Uir^/^. 

Theorem 1.2. Let N = M + Z , where M is a [^iq, ^i) -incoherent matrix of rank r, and assume 
that the subset of revealed entries E C [m] x [n] is uniformly random with size \E\. Further, let 
5^min = Si < • • • < = Smax with Smax/Smin = f^- Let M be the output of OptSpace on input 
. Then there exists numerical constants C and C such that if 

\E\ > Cn\/aK max jyUorya log n ; ^i^r an ; ji^r an | , 

then, with probability at least 1 — , 

\\M-M\\f < C'^2!H^||^i?||2 _ (8) 
yjmn \E\ 

provided that the right-hand side is smaller than Smin- 

Apart from capturing the effect of additive noise, these two theorems improve over the work of 
[KMOOO] even in the noiseless case. Indeed they provide quantitative bounds in finite dimensions, 
while the results of |KMO09] were only asymptotic. 



1.4 Noise Models 

In order to make sense of the above results, it is convenient to consider a couple of simple models 
for the noise matrix Z: 

Independent entries model. We assume that Z's entries are independent random variables, with 
zero mean E{Zjj} = and sub-gaussian tails. The latter means that 

r{\Zij\ >x}< 2e~^ , 

for some bounded constant a^. 

Worst case model. In this model Z is arbitrary, but we have an uniform bound on the size of its 
entries: \Zij\ < Zmax- 

The basic parameter entering our main results is the operator norm of Z^, which is bounded as 
follows. 

Theorem 1.3. If Z is a random matrix drawn according to the independent entries model, then 
there is a constant C such that, 

llz-ll,<c,(i^lSMi)"\ 

with probability at least 1 — 1/n^. 
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Figure 1: Root mean square error achieved by OptSpace for reconstructing a random rank-2 matrix, as a 
function of the number of observed entries \E\, and of the number of hue minimizations. The performance of 
nuclear norm minimization and an information theory lower bound are also shown. 



If Z is a matrix from the worst case model, then 

\\Z^\\2 < ^ J- .^max ) (9) 

n^/a 

for any realization of E. 

Note that for \E\ = Vl{n\ogn) , no row or column is over-represented with high probability. 
It follows that in the regime of \E\ for which the conditions of Theorem 11.21 are satisfied, we have 
Z^ = Z^ . Then, among the other things, this result implies that for the independent entries 
model the right-hand side of our error estimate, Eq. ([8]), is with high probability smaller than 
Emin, if \E\ > Cra^/^n log nK^((T/Smin)^. For the worst case model, the same statement is true if 

1.5 Comparison with Related Work 

Let us begin by mentioning that a statement analogous to our preliminary Theorem 11.11 was proved 
in |AM07j . Our result however applies to any number of revealed entries, while the one of |AM07] 
requires \E\ > (81ogn)'*n (which for n < 5 • 10^ is larger than nP). 

As for Theorem II. 2^ we will mainly compare our algorithm with the convex relaxation approach 
recently analyzed in [CP09| . Our basic setting is indeed the same, while the algorithms are rather 
different. 

Figure [1] compares the average root mean square error for the two algorithms as a function of ji?!. 
Here M is a random rank r = 2 matrix of dimension m = n = 600, generated by letting M = UV'^ 
with Uij,Vij i.i.d. A^(0,20/y^). The noise is distributed according to the independent noise model 
with Zij N(0, 1). This example is taken from Figuer 2 in |CP09| . from which we took the data for 
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the convex relaxation approach, as well as the information theory lower bound. After one iteration, 
OptSpace has a smaller root mean square error than [CP09| . and in about 10 iterations it becomes 
indistiguishable from the information theory lower bound. 

Next let us compare our main result with the performance guarantee of Theorem 7 in [CP09] . 
Let us stress that we require some bound on the condition number k, while the analysis of [ CP09j 
and [CT09j require a stronger incoherence assumption. As far the error bound is concerned, |CP09] 
proved 



M -M\\f < 7 J— ||Z^||f + — —\\Z^\\f. (10) 




(The constant in front of the first term is in fact slightly smaller than 7 in |CP09j . but in any case 
larger than 4\/2)- 

Theorem 11.21 improves over this result in several respects: (1) We do not have the second term on 
the right hand side of (jlOp . that actually increases with the number of observed entries; (2) Our error 
decreases as n/| ^1 rather than (n/|£^|)^/2. (3) xhe noise enters Theorem 11.21 through the operator 
norm ||Z^||2 instead of its Frobenius norm ||Z^||i? > ||Z^||2. For E uniformly random, one expects 
to be roughly of order \\Z'^\\2\/n. For instance, within the intependent entries model with 
bounded variance o", HZ^Hi? = B(y^|£'|) while ||Z^||2 is of order \J\E\ln (up to logarithmic terms). 



2 Some Notations 



The matrix M to be reconstructed takes the form ([T]) where U G R™^'', V E R"^''. We write U = 
[ui, U2, . . . , Ur] and V = [vi,V2, ■ ■ ■ , Vr] for the columns of the two factors, with | |nj| | = ^/rn, \ \vi\ \ = 
-y/n, and ujuj = 0, vfvj = for i ^ j (there is no loss of generality in this, since normalizations can 
be absorbed by redefining S). 

We shall write S = diag(Si, ...,$],.) with Si > S2 > • • • > S,. > 0. The maximum and minimum 
singular values will also be denoted by Smax = and Smin = '^r- Further, the maximum size of an 
entry of M is Mmax = maxjj | Mij \ . 

Probability is taken with respect to the uniformly random subset E C [m] x [n] given \E\ and 
(eventually) the noise matrix Z. Define e = \E\/^/mn. In the case when m = n, e corresponds to the 
average number of revealed entries per row or column. Then it is convenient to work with a model 
in which each entry is revealed independently with probability e/y/mn. Since, with high probability 
l-E"! G \e\foLn — Ay/ n log n, e^/a n + A^Jn log nj, any guarantee on the algorithm performances that 
holds within one model, holds within the other model as well if we allow for a vanishing shift in e. 
We will use C, C etc. to denote universal numerical constants. 

Given a vector x G R", | |x| | will denote its Euclidean norm. For a matrix X € R"^" , | |X| \p is its 
Frobenius norm, and \ \X\\2 its operator norm (i.e. \\X\\2 = sup^_^o I l^^l I/I I""! I)- The standard scalar 
product between vectors or matrices will sometimes be indicated by {x,y) or (X, y), respectively. 
Finally, we use the standard combinatorics notation [A^] = {1,2,..., A^} to denote the set of first A 
integers. 



3 Proof of Theorem 11.11 

As explained in the introduction, the crucial idea is to consider the singular value decomposition of 
the trimmed matrix A^ instead of the original matrix A^^. Apart from a trivial rescaling, these 
singular values are close to the ones of the original matrix M. 
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Lemma 3.1. There exists a numerical constant C such that, with probability greater than 1 — 

la 1 , 
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< CMrr 



- + -\\z' 



e e 



(11) 



where it is understood that Sg = for q > r. 



Proof. For any matrix A, let o'q{A) denote the qth singular value of A. Then, o'q{A + B) < (7q{A) + 
ai{B), whence 



< 



CJqiM^ 



+ 



< CM, 



max A / 1 \\Z 2 

e e 



where the second inequality follows from the next Lemma as shown in |KMO09 ] . 

Lemma 3.2 (Keshavan, Montanari, Oh, 2009). There exists a numerical constant C such that, with 
probability larger than 1 — 



M - 



(12) 
□ 



We will now prove Theorem II. 1[ 
Proof. (Theorem II .ip For any matrix A of rank at most 2r, ||^||_f < -v/2r||^||2) whence 



1 



'mn 



\M-Jr{N'')\\F < 



2r 



< 



fmn 

/2f 

/mn 



M 



'mn 



i>r+l 



T 



M 



'mn ~p 



+ 

2 e 



mn,, — p., 

11^^ 2 + 



'mn 



-fJr+l 



< C"M„ 



e 

nra^/"^ 
\E\ 



e 

1/2 



This proves our claim. 



□ 



4 Proof of Theorem 11.21 

Recall that the cost function is defined over the Riemannian manifold M(m, n) = G(m,r) x G(n,r). 
The proof of Theorem 11.21 consists in controlling the behavior of -F in a neighborhood of u = ({/, V) 
(the point corresponding to the matrix M to be reconstructed). Throughout the proof we let /C(/u) 
be the set of matrix couples {X,Y) e E™^'' x R"^^ such that < /ir, < /ur for ah 
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4.1 Preliminary Remarks and Definitions 

Given xi = (Xi,Yi) and X2 = {X2,Y2) G M(m, n), two points on this manifold, their distance is 
defined as d(xi,X2) = ^yd{Xl, + ^(^1, 5^2)^, where, letting (cos 0i, . . . , cos 6*^) be the singular 
values of X'[X2/m, 

d{XuX2) = \\e\\2. 

Given S achieving the minimum in Eq. (I3|), it is also convenient to introduce the notations 



d_(x,u)^ JS2^i„d(x,u)2 + ||5-S||2,, 



d+(x,u)^^SLxd(x,u)2 + ||5-S||2,. 
4.2 Auxiliary Lemmas and Proof of Theorem 11.21 

The proof is based on the following two lemmas that generalize and sharpen analogous bounds in 
[KMU09j . 

Lemma 4.1. There exists numerical constants Co,Ci,C2 such that the following happens. Assume 
e > Co^o'^\/a max{logn; /xor^(i;mm/Smax)'* } and 5 < i;min/(CoSmax)- Then, 

F(x)-F(u) > CineVad_(x,u)2 - CinVra||Z^||2(i+(x,u) , (13) 
F(x)-F(u) < C2ne^/^S^a^d(x,u)2 + C2nV^||Z-^||2(i+(x,u), (14) 

for all X € M(m, n) n/C(4/xo) such that d(x, u) < 5, with probability at least 1 — Here S G W^"^ 

is the matrix realizing the minimum in Eq. 

Corollary 4.2. There exist a constant C such that, under the hypotheses of Lemma \4.1\ 

||5 - < CS^axtZ(x,u) + ||Z^||2 . 

Further, for an appropriate choice of the constants in Lemma \4.1\ we have 



CTmax(5) <2S,„ax + C^||^''||2, (15) 

e 

CTmin(5) > ^S^in-C^||Z^||2. (16) 

Lemma 4.3. There exists numerical constants Co,Ci,C2 such that the following happens. Assume 
e > Co/io?'\/a(Smax/Smm)^max{logn; /Uor^(i;max/Smin)^ } and 5 < Smin/(CoSmax)- Then, 



||gradF(x)||2>Cine2s^j^ 



,/ N ^ V^^max 1 I2 

d{x,u)-C2- 



C^min ^min 



(17) 



for all X G M(m,n) n /C(4/io) such that d(x, u) < S, with probability at least 1 — 1/n'^. (Here 
[a]+ = max(a, 0).) 

We can now turn to the proof of our main theorem. 
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Proof. (Theorem O]). Let 5 = Smin/Co ^max with Co large enough so that the hypotheses of Lemmas 
14.11 and 14.31 are verified. 

Call {xfe}fc>o the sequence of pairs {Xk,Yk) G M(m,n) generated by gradient descent. By as- 
sumption, the following is true with a large enough constant C: 

\\Z^\\2<^(^y^min. (18) 



Further, by using Corollary 14.21 in Eqs. (jl3p and (|14p we get 

F(x)-F(u) > CineV^^li^{d{^,uf - Sl_} , (19) 
F(x)-F(u) < C2ne^/^sLx{t^(x,u)2+52.^}, (20) 

where 

r _ r^V^^max \\Z^\\2 c _ ^ \A'Smax H'^^'^lb 

oo - = (-^ — 1 oo + = — — — . 

By Eq. we can assume (5o,+ < (5o,- < S/W. 

For e > Ca/xfr^ (Sinax/Smin)^ as per our assumptions, using Eq. ([TH]) in Theorem ll.H together 
with the bound (i(u, xq) < \\M — XQSY^\\p/n^/aT,^iI^, we get 

cZ(u,xo) < — . 

We make the following claims : 

1. Xfc G /C(4^o) for all A;. 

Indeed without loss of generality we can assume xq G /C(3//o) (because otherwise we can 
rescale those lines of Xq, Yq that violate the constraint). Therefore F{xq) = F{xq) < 
AC2ne^/aT.'^^J'^/W0. On the other hand F(x) > p{e^/^ - 1) for x /C(4//o). Since F(xfc) is a 
non-increasing sequence, the thesis follows provided we take p > C2ne^/a'L'^^J^. 

2. d(xfc,u) < 5/10 for all k. 

Assuming e > Ca/xfr2(SI„ax/Smm)^ we have d(xo,u)2 < (E^jJC'E^ J (5/10)2. Also as- 
suming Eq. (jlSp with large enough C we can show the following. For all x^ such that 
(i(xfc,u) G [(5/10,(5], we have -F(x) > F{x) > F{xq). This contradicts the monotonicity of 
Fix), and thus proves the claim. 

Since the cost function is twice differentiable, and because of the above, the sequence {x^} 
converges to 

1^ = {x G /C(4^o) n M{m,n) : d{x,u) < (5,gradF(x) = O} . 
By Lemma 14.31 for any x G $7, 

d{x,u) < C— 

which implies the thesis using Corollarv 14.21 □ 
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4.3 Proof of Lemma 14.11 and Corollary 14.21 

Proof. (Lemma I4.ip The proof is based on the analogous bound in the noiseless case, i.e. Lemma 
5.3 in |KMQ09] . For readers' convenience, the result is reported in Appendix |Al Lemma lA.ll For 
the proof of these lemmas, we refer to |KMQ09] . 

In order to prove the lower bound, we start by noticing that 

F{u)<^\\Ve{Z)\\1, 

which is simply proved by using S* = S in Eq. ([3]). On the other hand, we have 

F(x) = ^\\Ve{XSY^-M-Z)\\1 (21) 

= + \\\'Pe{XSY'' - M)\\l - {Ve{Z), {XSY^ - M)) (22) 

> F{\i) +Cne^d^{^,uf - ^/2^\\Z'^\\2\\XSY^ - M\\f , (23) 
where in the last step we used Lemma lA.ll Now by triangular inequality 

\\XSY^-M\\], < 3\\X{S -^)Y^\\1 + 3\\XJ:{Y -Vf\\l + 3\\{X -U)^V^\\l 

< 3nm\\S - ml + 3n2aS^^^(l||X - U\\l + -\\Y - V\\l) 

m n 

< Cn'^adj^(yX.,\if' , 

In order to prove the upper bound, we proceed as above to get 

i^(x) < i||Pi=;(Z)||2, + C7neV^SL,(i(x,u)2 + V2^||^^||2Cnd+(x,u). 
Further, by replacing x with u in Eq. ()22p 



F(u) > \\\Ve{Z)\\1-{Ve{Z)AU{S-^)V^)) 
> ^\\VE{Z)\\l-V2^\\Z''\\2Cnd+{^,u). 



By taking the difference of these inequalities we get the desired upper bound. □ 

Proof. (Corollary 14. 2p By putting together Eq. (fT3l) and ([Hj), and using the definitions of d-|-(x, u), 
d_(x, u), we get 



||5 - < gELxd(x,u)2 + ^||Z^||2^SLx4x,u)2 + ||5-S|||, 

Without loss of generality, assume C2 > Ci, and call x = \\S — Ti\\e, a'^ = {C2/Ci)'E,'^^^d'^, and 
b = [^fC^r I ^fC\e)\\Z^\\2. The above inequality then takes the form 



which implies our claim x < a + h. 

The singular value bounds (jlSp and (|16p follow by triangular inequality. For instance 



C"min('S') ^ 5]tnin — CSinax'^(x, u) — C 1 I2 . 



e 

which implies the inequality (fTBl) for d(x, u) < 5 = Smin/C'oSmax and Co large enough. An analogous 
argument proves Eq. (fT5]) . □ 



10 



4.4 Proof of Lemma 14.31 

Without loss of generality we will assume 5 < 1, C2 > 1 and 



— Il^^lb < S„,in, (24) 



e 



because otherwise the lower bound ()17p is trivial for all d{x, u) < 6. 

Denote by t ^ ^ ^ [0)1]) geodesic on M(m, n) such that x(0) = u and x(l) = x, 

parametrized proportionally to the arclength. Let w = x(l) be its final velocity, with w = (W, Q). 
Obviously w S Tx (with Tx the tangent space of M(m, n) at x) and 



-\\W + -\\Q\\' = d{x,uf, 
m n 



because t x(t) is parametrized proportionally to the arclength. 

Explicit expressions for w can be obtained in terms of w = x(0) = {W,Q) [KMQ09] . If we let 
W = LQR^ be the singular value decomposition of W, we obtain 



W = -URQ sin Q + LG cos 9 R' . (25) 

It was proved in |KMO09] that (gradG(x), w) > 0. It is therefore sufficient to lower bound the 
scalar product (gradF, w). By computing the gradient of F we get 

(gradF(x),w) = {Ve{XSY^ - N),{XSQ^ + W SY^)) 

= {Ve{XSY^ - M), {XSQ'^ + WSY^)) - {Ve{Z), (XSQ'^ + WSY^)) 

= {gTadFo{x),w)-{VE{Z),{XSQ^ + WSY^)) (26) 

where Fq{x) is the cost function in absence of noise, namely 



Fo(X,Y)= min < 



i ((X5F^).,-M,,)n . (27) 
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As proved in |KMO09] . 

(grad Fo(x), w) > CneV^S^in(i(x, u)^ (28) 

(see Lemma lA.31 in Appendix). 

We are therefore left with the task of upper bounding {Ve{Z), {XSQ^ + WSY^)). Since XSQ'^ 
has rank at most r, we have 

{Ve{Z),XSQ^) < V^\\z%\\xsq^\\f. 

Since X^X = ml, we get 

WXSQ^Wl = mTt{S^SQ^Q)<naa^,^{Sf\\Q\\l 

< Cn^afs^ax + ^||^''||FVd(x,u)2 (29) 



e 

< 4Cn2aSLxC?(x,u)2, (30) 



11 



where, in inequality ([29|) . we used Corollary 14.21 and in the last step, we used Eq. Proceeding 
analogously for {Ve{Z),W SY^) , we get 

{Ve{Z), {XSQ^ + WSY^)) < C'nE^ax\/^ ll^^lb d(x, u) . 
Together with Eq. ()26p and (128p this implies 

(gradF(x),w) > Cin6y^S^i,d(x, u){(i(x, u) - ^3 ^^^^Hi^ M3k | , 

which implies Eq. (jl7p by Cauchy-Schwartz inequality. 



5 Proof of Theorem 11.31 

Proof. {Independent entries model ) 

The proof is analogous to the proof of Lemma 13.21 in [KMO09] . where the matrix M to be 
reconstructed plays the role of the noise matrix Z. We want to show that \x'^Z^y\ < Cay^aelog \E\ 
for all X € R™" and y G R" such that ||x|| = 1 and ||y|| = 1. This will be done by first reducing 
ourselves to x and y belonging to finite discretized sets. We define 

Tn = |x G |-^z|" : ||x|| < 1 

Notice that r„ C 5'„ = {x € R" : ||x|| < 1}. The next remark is proved in [FKS891 IFOOSj . and 
relates the original problem to the discretized one. 

Remark 5.1. Let R G R'"^" be a matrix. If \x'^Ry\ < B for all x G Tm and y G T„, then 
\x'^Ry'\ <{1-A)-^B for all x' G Sm and y' e Sn. 

Hence it is enough to show that, with high probability, \x'^ Z^y\ < Ca\/aelogn for all x G 
and y £ Tn- Given x G Sm, y G 5„, we define the set of light couples L as, 

L = : \xiyj\ < y/e/2mn^ , 

and let L be its complement, which we call heavy couples. In the following, we will prove that for 
any x G Tm and any y £ Tn, with high probability, the contribution of light couples, xiZf-yj, is 

bounded by Ca^fae and the contribution of heavy couples, Y^-iXiZ^yj, is bounded by Ccry^ae log \E\. 
Together with Remark 1 5. Ij this proves the desired thesis. 



Bounding the contribution of light couples 

Let us define the subset of row and column indices which have not been trimmed as Ai and Ar- Notice 
that A = {Ai,Ar) is a function of the random set E. For any E C [m] x [n] and A = {Ai, Ar) with 
Ai C [m], Ar C [n], we define Z^'^ by setting to zero the entries of Z that are not in E, those whose 
row index is not in Ai, and those whose column index not in Ar. Let X^'"^ = j)£L ^iZfj'^Uj be 
the contribution of the light couples, then we need to bound the error event : 

n{E, A) = {3 (x, y) G X T„ : \X^'^\ > Ca^e} . (31) 

Note that Z^ = Z^"^, and hence we want to bound F{H{E, A)}. We use the following remark, 
which is a slight modification of the proof for bounding the contribution of light couples given in 
KM0n9] . 
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Remark 5.2. There exists constants Ci and C2 depending only on a such that the following is true. 

1 



'{niE,A)} < 2("+"*)^('^) max F {n{E; A)} + ^ , 

\Ai\>m{l-S), 
\Ar\>n{l-S) 



(32) 



with 6 = max{e '^^'^,C2a} and H{x) the binary entropy function. 

Now we are left with the task of bounding T {T-i{E] A)} by concentration of measure inequahty 
for the sub-gaussian random matrix Z. 

Lemma 5.3. For any x G 5m and y G 5„, P (^X^'^ > Ca^/e^ < exp{— (C — 2)^/an}. 

Proof. We will bound the probability using a Chernoff Bound. To this end, we will first upper bound 



E 



as below. Note that for sub-gaussian Zij we have E[e''^'j] < e^ whence, 



E 



< E 



exp {X^xfy'ja'^) 

_ii,j)£E,i^Ai,jfAr 

< n fi + ^2A2^'2/y ^ 

V yjmn ■' 



< exp V 2X'^x]y^a'^ < exp 



'mn 



(33) 
(34) 

(35) 



Equation ([Mj) follows by the fact that exp(a;) < l-|-2x for < a; < i which is ensured by choosing 



A 



1 / mn 



— . The thesis follows from Chernoff bound after some calculus. 



□ 



We can now finish the upperbound on the light couples contribution. Consider the error event 
in Eq. (j3T]) . A simple volume calculation shows that \Tm\ < (lO/A)™. We can apply union bound 
over Tm and T„ to Eq. ([32]) to obtain 

¥{rL{E,A)} < exp{log2 + (l + a)(i7((^)log2 + log(20/A))n-(C7-2)V^n} + ^. 

Hence, assuming a > 1, there exists a numerical constant C such that, for C > C ^fa^ the first term 
is of order e~®''"'\ and this finishes the proof. 



Bounding the contribution of heavy couples 



The heavy couples are bounded by | YlT^i^^Vjl — ■^max X^^^^il^^yil' ^'^^'^^ -^n 



= max(jj)g£;{Zij} 

and Q is the adjacency matrix of the underlying graph. Using the following remark, we get that 
\Z^\\2 < C"Zinax\/ae with probability larger than 1 — l/2n^. 



{{i,j) ■ \xiyj\ > C\J e/mn^. Then there 



exist a constant C such that, '^f- Qij\xiyj\ < C \fae, with probability larger than 1 — l/2n^. 



Remark 5.4. Given vectors x G Tm and y G T„, let L 

'(*j)er' 

For the proof of this remark we refer to [KMQ09] . Further, for Z drawn from the independent 
model, we have 

P(^max > Ldx/logl^l) < \E\^^^ . 

For L larger than 4, we get the desired thesis. □ 
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Proof. ( Worst Case Model ) 

Let D be the m x n ah-ones matrix. Then for any matrix Z from the worst case model, we have 
ll'Z^^lb < ■^max||-D'^||2, since x^Z^y < Yli j ^nia,x\xi\Dfj\yj\, which fohows from the fact that Zj^'s 
are uniformly bounded. Further, is an adjacency matrix of a corresponding bipartite graph with 
bounded degrees. Then, for any choice of E the following is true for all positive integers k: 

WD^Wf < max ||x^((5^f < Tii{{D^fD^f) < n{2ef^ . 

x,\\x\\=l 

Taking k large, we get the desired thesis. □ 
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A Three Lemmas on the Noiseless Problem 

Lemma A.l. There exists numerical constants Co,Ci,C2 such that the following happens. Assume 
e > Cojior^/a max{logn; /xor^(i;mm/Smax)'^ } and 5 < i;mm/(CoSmax)- Then, 

Ci^S^in(i(x, u)^ + Cl^/a\\So - sill' < — Fo(x) < C2\/ai;maxf^(x, u)^ , 

ne 

forallx£ M(m, n)n/C(4/xo) such that d{x, u) <d, with probability at least 1 — 1/ n'^. Here Sq & MJ"^"^ 
is the matrix realizing the minimum in Eq. I^27\ ). 

Lemma A. 2. There exists numerical constants Cq and C such that the following happens. Assume 

max 

/Smin)^max{ log 

^ i /^0^\/^(Smax 

/Smin)^} and 5 < Smj„/(CoSinax)- Then 
||gradFo(x)||' > C ne^ S^i,d(x,u)2, 
for all x G M(m, n) n /C(4^o) such that (i(x, u) < 5, with probability at least 1 — 1/n'^. 



Lemma A. 3. Define w as in Eq. I125\). Then there exists numerical constants Cq and C such that 
the following happens. Under the hypothesis of Lemma \A.S\ 

(gradFo(x), w) > Cne^/a S^j^(i(x, u)^ , 

for all X G M(m, n) n /C(4/io) such that (i(x, u) < 5, with probability at least 1 — 1/n^. 
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